83 research outputs found

    Transfer from Multiple MDPs

    Get PDF
    Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them into the training set used to solve a given target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.Comment: 201

    A Novel Confidence-Based Algorithm for Structured Bandits

    Full text link
    We study finite-armed stochastic bandits where the rewards of each arm might be correlated to those of other arms. We introduce a novel phased algorithm that exploits the given structure to build confidence sets over the parameters of the true bandit problem and rapidly discard all sub-optimal arms. In particular, unlike standard bandit algorithms with no structure, we show that the number of times a suboptimal arm is selected may actually be reduced thanks to the information collected by pulling other arms. Furthermore, we show that, in some structures, the regret of an anytime extension of our algorithm is uniformly bounded over time. For these constant-regret structures, we also derive a matching lower bound. Finally, we demonstrate numerically that our approach better exploits certain structures than existing methods.Comment: AISTATS 202

    Estimating the maximum expected value in continuous reinforcement learning problems

    Get PDF
    This paper is about the estimation of the maximum expected value of an infinite set of random variables. This estimation problem is relevant in many fields, like the Reinforcement Learning (RL) one. In RL it is well known that, in some stochastic environments, a bias in the estimation error can increase step-by-step the approximation error leading to large overestimates of the true action values. Recently, some approaches have been proposed to reduce such bias in order to get better action-value estimates, but are limited to finite problems. In this paper, we leverage on the recently proposed weighted estimator and on Gaussian process regression to derive a new method that is able to natively handle infinitely many random variables. We show how these techniques can be used to face both continuous state and continuous actions RL problems. To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches

    Learning in Non-Cooperative Configurable Markov Decision Processes

    Get PDF
    The Configurable Markov Decision Process framework includes two entities: a Reinforcement Learning agent and a configurator that can modify some environmental parameters to improve the agent's performance. This presupposes that the two actors have the same reward functions. What if the configurator does not have the same intentions as the agent? This paper introduces the Non-Cooperative Configurable Markov Decision Process, a setting that allows having two (possibly different) reward functions for the configurator and the agent. Then, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose two learning algorithms to minimize the configurator's expected regret, which exploits the problem's structure, depending on the agent's feedback. While a naive application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically show the performance of our algorithm in simulated domains

    Best Arm Identification for Stochastic Rising Bandits

    Full text link
    Stochastic Rising Bandits is a setting in which the values of the expected rewards of the available options increase every time they are selected. This framework models a wide range of scenarios in which the available options are learning entities whose performance improves over time. In this paper, we focus on the Best Arm Identification (BAI) problem for the stochastic rested rising bandits. In this scenario, we are asked, given a fixed budget of rounds, to provide a recommendation about the best option at the end of the selection process. We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE, which resorts to a UCB-like approach, and R-SR, which employs a successive reject procedure. We show that they provide guarantees on the probability of properly identifying the optimal option at the end of the learning process. Finally, we numerically validate the proposed algorithms in synthetic and realistic environments and compare them with the currently available BAI strategies

    A Nanocryotron Ripple Counter Integrated with a Superconducting Nanowire Single-Photon Detector for Megapixel Arrays

    Full text link
    Decreasing the number of cables that bring heat into the cryocooler is a critical issue for all cryoelectronic devices. Especially, arrays of superconducting nanowire single-photon detectors (SNSPDs) could require more than 10610^6 readout lines. Performing signal processing operations at low temperatures could be a solution. Nanocryotrons, superconducting nanowire three-terminal devices, are good candidates for integrating sensing and electronics on the same technological platform as SNSPDs in photon-counting applications. In this work, we demonstrated that it is possible to read out, process, encode, and store the output of SNSPDs using exclusively superconducting nanowires. In particular, we present the design and development of a nanocryotron ripple counter that detects input voltage spikes and converts the number of pulses to an NN-digit value. The counting base can be tuned from 2 to higher values, enabling higher maximum counts without enlarging the circuit. As a proof-of-principle, we first experimentally demonstrated the building block of the counter, an integer-NN frequency divider with NN ranging from 2 to 5. Then, we demonstrated photon-counting operations at 405\,nm and 1550\,nm by coupling an SNSPD with a 2-digit nanocryotron counter partially integrated on-chip. The 2-digit counter operated in either base 2 or base 3 with a bit error rate lower than 2×10−42 \times 10^{-4} and a maximum count rate of 45×106 45 \times 10^6\,s−1^{-1}. We simulated circuit architectures for integrated readout of the counter state, and we evaluated the capabilities of reading out an SNSPD megapixel array that would collect up to 101210^{12} counts per second. The results of this work, combined with our recent publications on a nanocryotron shift register and logic gates, pave the way for the development of nanocryotron processors, from which multiple superconducting platforms may benefit
    • …
    corecore